Goto

Collaborating Authors

 Chiang Rai


Representing the Under-Represented: Cultural and Core Capability Benchmarks for Developing Thai Large Language Models

Kim, Dahyun, Lee, Sukyung, Kim, Yungi, Rutherford, Attapol, Park, Chanjun

arXiv.org Artificial Intelligence

The rapid advancement of large language models (LLMs) has highlighted the need for robust evaluation frameworks that assess their core capabilities, such as reasoning, knowledge, and commonsense, leading to the inception of certain widely-used benchmark suites such as the H6 benchmark. However, these benchmark suites are primarily built for the English language, and there exists a lack thereof for under-represented languages, in terms of LLM development, such as Thai. On the other hand, developing LLMs for Thai should also include enhancing the cultural understanding as well as core capabilities. To address these dual challenge in Thai LLM research, we propose two key benchmarks: Thai-H6 and Thai Cultural and Linguistic Intelligence Benchmark (ThaiCLI). Through a thorough evaluation of various LLMs with multi-lingual capabilities, we provide a comprehensive analysis of the proposed benchmarks and how they contribute to Thai LLM development. Furthermore, we will make both the datasets and evaluation code publicly available to encourage further research and development for Thai LLMs.


AI "News" Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian

Puccetti, Giovanni, Rogers, Anna, Alzetta, Chiara, Dell'Orletta, Felice, Esuli, Andrea

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are increasingly used as "content farm" models (CFMs), to generate synthetic text that could pass for real news articles. This is already happening even for languages that do not have high-quality monolingual LLMs. We show that fine-tuning Llama (v1), mostly trained on English, on as little as 40K Italian news articles, is sufficient for producing news-like texts that native speakers of Italian struggle to identify as synthetic. We investigate three LLMs and three methods of detecting synthetic texts (log-likelihood, DetectGPT, and supervised classification), finding that they all perform better than human raters, but they are all impractical in the real world (requiring either access to token likelihood information or a large dataset of CFM texts). We also explore the possibility of creating a proxy CFM: an LLM fine-tuned on a similar dataset to one used by the real "content farm". We find that even a small amount of fine-tuning data suffices for creating a successful detector, but we need to know which base LLM is used, which is a major challenge. Our results suggest that there are currently no practical methods for detecting synthetic news-like texts 'in the wild', while generating them is too easy. We highlight the urgency of more NLP research on this problem.


Segmentation-free Connectionist Temporal Classification loss based OCR Model for Text Captcha Classification

Khatavkar, Vaibhav, Velankar, Makarand, Petkar, Sneha

arXiv.org Artificial Intelligence

Captcha are widely used to secure systems from automatic responses by distinguishing computer responses from human responses. Text, audio, video, picture picture-based Optical Character Recognition (OCR) are used for creating captcha. Text-based OCR captcha are the most often used captcha which faces issues namely, complex and distorted contents. There are attempts to build captcha detection and classification-based systems using machine learning and neural networks, which need to be tuned for accuracy. The existing systems face challenges in the recognition of distorted characters, handling variable-length captcha and finding sequential dependencies in captcha. In this work, we propose a segmentation-free OCR model for text captcha classification based on the connectionist temporal classification loss technique. The proposed model is trained and tested on a publicly available captcha dataset. The proposed model gives 99.80\% character level accuracy, while 95\% word level accuracy. The accuracy of the proposed model is compared with the state-of-the-art models and proves to be effective. The variable length complex captcha can be thus processed with the segmentation-free connectionist temporal classification loss technique with dependencies which will be massively used in securing the software systems.


IoMT Technology Automates Vital Signs Measurement

#artificialintelligence

No one likes to schedule a medical appointment only to find an endless wait at a crowded doctor's office or clinic. But with a critical lack of healthcare workers, those waits aren't getting any shorter. The good news is IoMT (Internet of Medical Things) technology is helping take the pressure off overburdened staff. Self-service kiosks, powered by AI, can deliver a better patient experience--both in and out of the clinical setting. The shortage of medical workers may be new in some parts of the world, but it's a familiar problem in other markets.


New facial recognition technology caught 'imposter' using someone else's passport, US officials say

The Independent - Tech

A new facial recognition technology caught a man trying to enter the US using a passport belonging to someone else, US officials say. Officials with the US Customs and Border Protection (CBP) and the Office of Field Operations (OFO) intercepted a 26-year-old man, the agencies referred to as an "imposter", who reportedly attempted to use a French passport belonging to someone else, at Washington's Dulles International Airport. The man was travelling to the US from Brazil. "The officer utilised CBP's new facial comparison biometric technology which confirmed the man was not a match to the passport he presented," the CBP press release read. It added: "A search revealed the man's authentic Republic of Congo identification card concealed in his shoe."


Sarah Jeong: New York Times journalist who tweeted 'cancel white people' is victim of 'dishonest' trolls, claims former employer

The Independent - Tech

Sarah Jeong, a technology journalist hired by the New York Times and vilified online for tweets comparing "dumbass f****** white people" to dogs and saying they would "all go extinct soon", has been targeted for harassment by dishonest trolls, her former employer has claimed. Editors at The Verge, an online tech magazine, denounced what they called "disingenuous" criticism of Ms Jeong by "people acting in bad faith". The senior writer had been the victim of a Gamergate-style campaign designed to "divide and conquer by forcing newsrooms to disavow their colleagues", they suggested. Ms Jeong, 30, posted a string of offensive and apparently racist messages including "#CancelWhitePeople" and "white men are bulls***" up to five years ago. After being uncovered they quickly spread and were picked up by conservative media including the Daily Caller and Gateway Pundit websites.